In this session, we delve into the practicalities of importing and processing data from various sources using R, a crucial skill for any data analyst. Our exploration starts with making simple API calls to fetch data, providing a foundation for understanding the structure and format of JSON and XML data. We will then advance to more complex API interactions and data processing techniques, equipping you with the skills to handle real-world data extraction tasks efficiently. Additionally, we’ll cover methods for extracting data from PDF files, an essential capability for dealing with unstructured data. The session ends with a comprehensive case study utilizing World Bank data, where we’ll apply our learned techniques to analyze and report on global economic indicators.
Feedback should be send to
goran.milovanovic@datakolektiv.com. These notebooks
accompany the ADVANCED ANALYST - Foundations for Advanced Data Analytics
in R DataKolektiv
training.
Until now, we have used only data stored as .csv or
.xlsx files. As we have seen it was possible to map such
files 1:1 onto dataframes in R: columns were defined, with the first
rows holding column names by convention… However, in Data Science, many
times we face the situation in which we have to obtain some data from a
source that does not necessarily deliver “tabular” data
structures. Also, many times we will be facing data structures that are
essentially not “tabular”: for example, data structures in which some
elements contain certain fields which other elements do not - and not
because there are missing data, but because sometimes it does not make
sense for something to be described by a certain attribute at all.
Imagine describing David
Bowie, a famous English musician, by a data structure: we can know
his date of birth, and the release dates of his albums perhaps, but
should be place all that data into one single column? No, because the
semantics of such a field would be weird, of course. How would we
organize the rows in a dataframe describing David Bowie: the first row
describes the person, while other rows describe his works of art? So, we
have a column for dateOfBirth, and that column has a value
in the first row of the dataframe only, and then we have a column for
releaseDate, and that column holds NA in the
first row and then a timestamp in all other rows that refer to his
albums? Wait, what about his spouse, children, collaborators: assign a
row in a dataframe to each one?
No. Of course, a list in R would do, correct?
In the following example we will access a free REST API from within our R environment, collect the API response as JSON, convert it to an R list, and play with the data.
In this example we will rely on the free https://datausa.io/ API to obtain statistical data. Here is the intro to their API: datausa.io API.
baseEndPoint <- "https://datausa.io/api/data"
We will use {httr} to get in touch with the API. It is a part of {tidyverse}.
library(httr)
Step 1. Define API parameters.
First we define the API parameters.
### --- compose API call
# - use base API endpoint
# - and concatenate with API parameters
# - from the following example: https://datausa.io/about/api/
# - parameter: drilldowns
drilldowns <- paste0("drilldowns=", "Nation")
# - parameter: measures
measures <- paste0("measures=", "Population")
# - parameters:
params <- paste("&", c(drilldowns, measures),
sep = "", collapse = "")
cat(params)
&drilldowns=Nation&measures=Population
Step 2. Compose API call.
We put together the baseEndPoint with the API call
parameters:
api_call <- paste0(baseEndPoint, "?", params)
cat(api_call)
https://datausa.io/api/data?&drilldowns=Nation&measures=Population
Step 3. Make API call.
We use httr::GET() to contact the API, ask for data, and
fetch the result:
response <- httr::GET(URLencode(api_call))
class(response)
[1] "response"
The URLencode(api_call) call to the base R
URLencode() function will take care of Percent-encoding
where and if necessary. Hint: always use
URLencode(your_api_call).
We can see that response is now of a
response class. It is pretty structured and rich
indeed:
str(response)
List of 10
$ url : chr "https://datausa.io/api/data?&drilldowns=Nation&measures=Population"
$ status_code: int 200
$ headers :List of 25
..$ date : chr "Wed, 05 Jun 2024 17:08:47 GMT"
..$ content-type : chr "application/json; charset=utf-8"
..$ x-dns-prefetch-control : chr "off"
..$ strict-transport-security : chr "max-age=15552000; includeSubDomains"
..$ x-download-options : chr "noopen"
..$ x-content-type-options : chr "nosniff"
..$ x-xss-protection : chr "1; mode=block"
..$ content-language : chr "en"
..$ etag : chr "W/\"6e4-0ge41tM7RM4ctHp0uTGcx2UYtjI\""
..$ vary : chr "Accept-Encoding"
..$ content-encoding : chr "gzip"
..$ last-modified : chr "Wed, 05 Jun 2024 17:08:47 GMT"
..$ x-cache-status : chr "MISS"
..$ x-frame-options : chr "SAMEORIGIN"
..$ access-control-allow-origin : chr "*"
..$ access-control-allow-credentials: chr "true"
..$ access-control-allow-methods : chr "GET, POST, OPTIONS"
..$ access-control-allow-headers : chr "DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type"
..$ x-cache-key : chr "https://datausa.io/api/data?&drilldowns=Nation&measures=Population"
..$ cache-control : chr "max-age=1800"
..$ cf-cache-status : chr "MISS"
..$ report-to : chr "{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=XqNMZdwMfzWXlb8nEmyU1xVOiO9r2ID3TIk"| __truncated__
..$ nel : chr "{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}"
..$ server : chr "cloudflare"
..$ cf-ray : chr "88f1c4423ad74d8d-FRA"
..- attr(*, "class")= chr [1:2] "insensitive" "list"
$ all_headers:List of 1
..$ :List of 3
.. ..$ status : int 200
.. ..$ version: chr "HTTP/2"
.. ..$ headers:List of 25
.. .. ..$ date : chr "Wed, 05 Jun 2024 17:08:47 GMT"
.. .. ..$ content-type : chr "application/json; charset=utf-8"
.. .. ..$ x-dns-prefetch-control : chr "off"
.. .. ..$ strict-transport-security : chr "max-age=15552000; includeSubDomains"
.. .. ..$ x-download-options : chr "noopen"
.. .. ..$ x-content-type-options : chr "nosniff"
.. .. ..$ x-xss-protection : chr "1; mode=block"
.. .. ..$ content-language : chr "en"
.. .. ..$ etag : chr "W/\"6e4-0ge41tM7RM4ctHp0uTGcx2UYtjI\""
.. .. ..$ vary : chr "Accept-Encoding"
.. .. ..$ content-encoding : chr "gzip"
.. .. ..$ last-modified : chr "Wed, 05 Jun 2024 17:08:47 GMT"
.. .. ..$ x-cache-status : chr "MISS"
.. .. ..$ x-frame-options : chr "SAMEORIGIN"
.. .. ..$ access-control-allow-origin : chr "*"
.. .. ..$ access-control-allow-credentials: chr "true"
.. .. ..$ access-control-allow-methods : chr "GET, POST, OPTIONS"
.. .. ..$ access-control-allow-headers : chr "DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type"
.. .. ..$ x-cache-key : chr "https://datausa.io/api/data?&drilldowns=Nation&measures=Population"
.. .. ..$ cache-control : chr "max-age=1800"
.. .. ..$ cf-cache-status : chr "MISS"
.. .. ..$ report-to : chr "{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=XqNMZdwMfzWXlb8nEmyU1xVOiO9r2ID3TIk"| __truncated__
.. .. ..$ nel : chr "{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}"
.. .. ..$ server : chr "cloudflare"
.. .. ..$ cf-ray : chr "88f1c4423ad74d8d-FRA"
.. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
$ cookies :'data.frame': 0 obs. of 7 variables:
..$ domain : logi(0)
..$ flag : logi(0)
..$ path : logi(0)
..$ secure : logi(0)
..$ expiration: 'POSIXct' num(0)
..$ name : logi(0)
..$ value : logi(0)
$ content : raw [1:1764] 7b 22 64 61 ...
$ date : POSIXct[1:1], format: "2024-06-05 17:08:47"
$ times : Named num [1:6] 0 0.0647 0.0942 0.1693 0.7356 ...
..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
$ request :List of 7
..$ method : chr "GET"
..$ url : chr "https://datausa.io/api/data?&drilldowns=Nation&measures=Population"
..$ headers : Named chr "application/json, text/xml, application/xml, */*"
.. ..- attr(*, "names")= chr "Accept"
..$ fields : NULL
..$ options :List of 2
.. ..$ useragent: chr "libcurl/8.6.0 r-curl/5.2.1 httr/1.4.7"
.. ..$ httpget : logi TRUE
..$ auth_token: NULL
..$ output : list()
.. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
..- attr(*, "class")= chr "request"
$ handle :Class 'curl_handle' <externalptr>
- attr(*, "class")= chr "response"
You need to check one thing: the server status response.
response$status_code
[1] 200
200 means that your request was processed successfully.
Introduce yourself to server status responses and learn a bit about them
from the following source: HTTP
response status codes.
The results is found in response$content, but…
class(response$content)
[1] "raw"
print(response$content[1:50])
[1] 7b 22 64 61 74 61 22 3a 5b 7b 22 49 44 20 4e 61 74 69 6f 6e 22 3a 22 30 31 30 30 30
[29] 55 53 22 2c 22 4e 61 74 69 6f 6e 22 3a 22 55 6e 69 74 65 64 20 53
What is raw? It means that your data were obtained as
raw binary data and they need to be decoded into an R
character class representation. It is easy:
resp <- rawToChar(response$content)
class(resp)
[1] "character"
Is resp lengthy?
nchar(resp)
[1] 1764
cat(resp)
{"data":[{"ID Nation":"01000US","Nation":"United States","ID Year":2022,"Year":"2022","Population":331097593,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2021,"Year":"2021","Population":329725481,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2020,"Year":"2020","Population":326569308,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2019,"Year":"2019","Population":324697795,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2018,"Year":"2018","Population":322903030,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2017,"Year":"2017","Population":321004407,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2016,"Year":"2016","Population":318558162,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2015,"Year":"2015","Population":316515021,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2014,"Year":"2014","Population":314107084,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2013,"Year":"2013","Population":311536594,"Slug Nation":"united-states"}],"source":[{"measures":["Population"],"annotations":{"source_name":"Census Bureau","source_description":"The American Community Survey (ACS) is conducted by the US Census and sent to a portion of the population every year.","dataset_name":"ACS 5-year Estimate","dataset_link":"http://www.census.gov/programs-surveys/acs/","table_id":"B01003","topic":"Diversity","subtopic":"Demographics"},"name":"acs_yg_total_population_5","substitutions":[]}]}
Now we can see that the API response is JSON indeed. To work with JSON in R, we need to convert it into some R known data structures. For example a list.
JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and write, and easy for machines to parse and generate. It is widely used for data transmission between a server and a web application, as well as for storing and exchanging data in various contexts, including APIs.
Human-readable: JSON is formatted in a way that is easy to understand for humans, making it ideal for data documentation and debugging.
Lightweight: It is a text format that is concise and easy to parse.
Language-independent: While derived from JavaScript, JSON is language-agnostic and can be used with most programming languages, including R.
JSON data is organized in two primary structures:
{}.
Example:
{
"name": "Alice",
"age": 30,
"isStudent": false
}[].
Example:
[
"apple",
"banana",
"cherry"
]In R, you can use the jsonlite package to work with JSON
data. The package provides functions to read JSON data from a file or
URL and convert it into R data frames or lists, and vice versa.
print(response)
Response [https://datausa.io/api/data?&drilldowns=Nation&measures=Population]
Date: 2024-06-05 17:08
Status: 200
Content-Type: application/json; charset=utf-8
Size: 1.76 kB
print(resp)
[1] "{\"data\":[{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2022,\"Year\":\"2022\",\"Population\":331097593,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2021,\"Year\":\"2021\",\"Population\":329725481,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2020,\"Year\":\"2020\",\"Population\":326569308,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2019,\"Year\":\"2019\",\"Population\":324697795,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2018,\"Year\":\"2018\",\"Population\":322903030,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2017,\"Year\":\"2017\",\"Population\":321004407,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2016,\"Year\":\"2016\",\"Population\":318558162,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2015,\"Year\":\"2015\",\"Population\":316515021,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2014,\"Year\":\"2014\",\"Population\":314107084,\"Slug Nation\":\"united-states\"},{\"ID Nation\":\"01000US\",\"Nation\":\"United States\",\"ID Year\":2013,\"Year\":\"2013\",\"Population\":311536594,\"Slug Nation\":\"united-states\"}],\"source\":[{\"measures\":[\"Population\"],\"annotations\":{\"source_name\":\"Census Bureau\",\"source_description\":\"The American Community Survey (ACS) is conducted by the US Census and sent to a portion of the population every year.\",\"dataset_name\":\"ACS 5-year Estimate\",\"dataset_link\":\"http://www.census.gov/programs-surveys/acs/\",\"table_id\":\"B01003\",\"topic\":\"Diversity\",\"subtopic\":\"Demographics\"},\"name\":\"acs_yg_total_population_5\",\"substitutions\":[]}]}"
library(jsonlite)
data <- jsonlite::fromJSON(resp)
data is now a list:
data$source
data$data
json_data <- toJSON(data)
print(json_data)
{"data":[{"ID Nation":"01000US","Nation":"United States","ID Year":2022,"Year":"2022","Population":331097593,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2021,"Year":"2021","Population":329725481,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2020,"Year":"2020","Population":326569308,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2019,"Year":"2019","Population":324697795,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2018,"Year":"2018","Population":322903030,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2017,"Year":"2017","Population":321004407,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2016,"Year":"2016","Population":318558162,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2015,"Year":"2015","Population":316515021,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2014,"Year":"2014","Population":314107084,"Slug Nation":"united-states"},{"ID Nation":"01000US","Nation":"United States","ID Year":2013,"Year":"2013","Population":311536594,"Slug Nation":"united-states"}],"source":[{"measures":["Population"],"annotations":{"source_name":"Census Bureau","source_description":"The American Community Survey (ACS) is conducted by the US Census and sent to a portion of the population every year.","dataset_name":"ACS 5-year Estimate","dataset_link":"http://www.census.gov/programs-surveys/acs/","table_id":"B01003","topic":"Diversity","subtopic":"Demographics"},"name":"acs_yg_total_population_5","substitutions":[]}]}
# write(json_data, file = "data.json")
Understanding JSON is essential for advanced analytics, as it enables seamless integration with web APIs and efficient data handling in your R projects.
Let’s plot the time series of the US population over years then:
library(ggplot2)
library(ggrepel)
ggplot(data = data$data,
aes(x = Year,
y = Population,
label = Population)) +
geom_path(size = .25, color = "blue", group = 1) +
geom_point(size = 2, color = "blue") +
geom_label_repel(size = 3) +
ggtitle("US Population") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
For each API that you want to use you will need to read its documentation and learn about the parameters that you may pass to it.
I have stripped this API call from https://datausa.io/profile/soc/education-legal-community-service-arts-media-occupations.
You can copy and paste the entire API call into your browsers navigation bar to obtain the JSON response directly.
The data are on education, legal, community service, arts, & media occupations in the USA.
Make a call and check the server response status:
api_call <- paste0(baseEndPoint,
"?",
paste("PUMS Occupation=210000-270000",
"measure=Total Population,Total Population MOE Appx,Record Count",
"drilldowns=Wage Bin",
"Workforce Status=true",
"Record Count>=5",
sep = "&"))
response <- GET(URLencode(api_call))
response$status
[1] 200
Convert the response to JSON and than to list and a data.frame:
response <- rawToChar(response$content)
response <- fromJSON(response)
data <- response$data
head(data)
Visualize with {ggplot2}:
data$`Wage Bin` <- factor(data$`Wage Bin`,
levels = unique(data$`Wage Bin`))
ggplot(data = data,
aes(x = Year,
y = log(`Total Population`),
color = `Wage Bin`,
fill = `Wage Bin`)) +
geom_path(size = 1.5, group = 1) +
geom_point(size = 13) +
facet_wrap(~`Wage Bin`, ncol = 2) +
ggtitle("US: Education, legal, community service, arts, & media occupations") +
theme_bw() +
theme(panel.border = element_blank()) +
theme(plot.title = element_text(hjust = .5, size = 70)) +
theme(axis.text.x = element_text(angle = 90, size = 40)) +
theme(axis.title.x = element_text(size = 50)) +
theme(axis.text.y = element_text(size = 40)) +
theme(axis.title.y = element_text(size = 50)) +
theme(legend.text = element_text(size = 50)) +
theme(legend.title = element_text(size = 50)) +
theme(strip.text = element_text(size = 45)) +
theme(strip.background = element_blank()) +
theme(legend.position = "top")
Wikidata is a free, collaborative, multilingual knowledge base that stores structured data to support Wikipedia and other Wikimedia Foundation projects. It serves as a central repository for data, enabling easy access and reuse of information across various platforms and applications.
Key Features:
In the following examples we will be using the Wikibase API to obtain data from Wikidata, the World’s largest open knowledge base that comprises all structured information from Wikipedia and many other sources.
library(XML) # - parse XML format
We will contact the Wikibase API to obtain all data stored in Wikidata on David Bowie (who has a Q identifier of Q5383 in this knowledge base). We will ask the Wikibase API to use JSON to describe its response. Here is how the JSON response will look like: Wikibase API response.
query <-
'https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5383&languages=en&format=json'
response <- GET(URLencode(query))
response <- rawToChar(response$content)
response <- fromJSON(response)
class(response)
[1] "list"
Now, Wikidata is very complex (and thus very
powerful as a descriptive system; after all, it’s goal is to be able to
describe just anything that we can imagine, talk, and write
about), so what fromJSON() returns is a nasty, nasty,
nested list:
instaceOf_DavidBowie <- response$entities$Q5383$claims$P31
instaceOf_DavidBowie$mainsnak$datavalue$value
It is necessary to study the Wikidata DataModel carefully in order to be able to navigate the knowledge structures that it describes:
labelOf_DavidBowie <- response$entities$Q5383$labels$en$value
labelOf_DavidBowie
[1] "David Bowie"
However, once you do learn about Wikidata’s data model… Tens of millions of highly structured items and relations among them will become accessible to you. Order emerges from chaos in this case, I assure you. Besides JSON, there is XML (and many more, but we will focus on these two formats).
XML (eXtensible Markup Language) is a versatile text-based format used for representing structured data. It is designed to be both human-readable and machine-readable, making it a popular choice for data interchange between systems.
XML documents consist of elements enclosed in tags, with a root element that contains all other elements.
Example of a Simple XML Document:
<person>
<name>John Doe</name>
<age>30</age>
<email>john.doe@example.com</email>
</person>Example with Nested Elements:
<bookstore>
<book>
<title>Effective R Programming</title>
<author>Jane Smith</author>
<price>29.99</price>
</book>
<book>
<title>Data Science with R</title>
<author>John Doe</author>
<price>39.99</price>
</book>
</bookstore>Let’s get back to the Wikibase API and ask for the same data on David Bowie wrapped in an XML response:
query <-
'https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5383&languages=en&format=xml'
response <- GET(URLencode(query))
response <- rawToChar(response$content)
response <- xmlParse(response)
response <- xmlToList(response)
Note: format=xml.
The English label for David Bowie in Wikidata:
response$entities$entity$labels$label
language value
"en" "David Bowie"
class(response$entities$entity$labels$label)
[1] "character"
Now, David Bowie, in all languages available in Wikidata. First, we get the data.
query <-
'https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q5383&format=json'
response <- GET(URLencode(query))
response <- rawToChar(response$content)
response <- fromJSON(response)
Second: study the structure of the response, and then
lapply() across the appropriate set of lists:
labels <- lapply(response$entities$Q5383$labels, function(x) {
paste0(x$value, " (", x$language, ")")
})
labels <- paste(labels, collapse = ", ")
print(labels)
[1] "David Bowie (fr), David Bowie (de), David Bowie (en-ca), David Bowie (en-gb), ديفيد بوي (ar), Devid Boui (az), Дэвід Боўі (be), Дейвид Боуи (bg), David Bowie (br), David Bowie (bs), David Bowie (ca), David Bowie (co), David Bowie (cs), David Bowie (cy), David Bowie (da), David Bowie (diq), Ντέιβιντ Μπόουι (el), David Bowie (eo), David Bowie (es), David Bowie (et), David Bowie (eu), دیوید بویی (fa), David Bowie (fi), David Bowie (ga), David Bowie (gl), דייוויד בואי (he), डेविड बोवी (hi), David Bowie (hr), David Bowie (hu), David Bowie (id), David Bowie (io), David Bowie (is), David Bowie (it), デヴィッド・ボウイ (ja), David Bowie (jv), დევიდ ბოუი (ka), 데이비드 보위 (ko), David Bowie (li), David Bowie (lt), Deivids Bovijs (lv), Дејвид Боуви (mk), David Bowie (nl), David Bowie (nn), David Bowie (oc), David Bowie (pl), David Bowie (pms), David Bowie (pt), David Bowie (pt-br), David Bowie (ro), Дэвид Боуи (ru), David Bowie (scn), David Bowie (sh), David Bowie (sk), David Bowie (sl), David Bowie (sq), Дејвид Боуи (sr), David Bowie (sv), డేవిడ్ బౌవీ (te), เดวิด โบอี (th), David Bowie (tr), Девід Бові (uk), David Bowie (uz), David Bowie (vi), David Bowie (vls), 大卫·鲍伊 (zh), 大衛寶兒 (yue), David Bowie (de-ch), David Bowie (qu), David Bowie (la), Дэйвід Боўі (be-tarask), David Bowie (nb), David Bowie (mg), Դեյվիդ Բոուի (hy), David Bowie (ast), David Bowie (sco), David Bowie (lb), Дэвид Боуи (kk), David Bowie (ia), David Bowie (sd), David Bowie (gsw), डेविड बोवी (bho), David Bowie (fo), David Bowie (hsb), ਡੇਵਿਡ ਬੋਵੀ (pa), David Bowie (szl), David Bowie (af), David Bowie (an), David Bowie (bar), David Bowie (de-at), David Bowie (frp), David Bowie (fur), David Bowie (gd), David Bowie (ie), David Bowie (kg), David Bowie (lij), David Bowie (min), David Bowie (ms), David Bowie (nap), David Bowie (nds), David Bowie (nds-nl), David Bowie (nrm), David Bowie (pcd), David Bowie (rm), David Bowie (sc), David Bowie (sr-el), David Bowie (sw), David Bowie (vec), David Bowie (vo), David Bowie (wa), David Bowie (wo), David Bowie (zu), دیوید بویی (azb), ডেভিড বোয়ি (bn), David Bowie (eml), David Bowie (bcl), دەیڤد بویی (ckb), David Bowie (fy), David Bowie (lmo), David Bowie (nah), Devid Bowi (tt), Боуи Дэвид (cv), ഡേവിഡ് ബോയി (ml), David Bowie (nan), David Bowie (war), David Bowie (ilo), Дэвид Боуи (ba), 大衛寶兒 (zh-hk), 大卫·鲍伊 (zh-cn), 大卫·鲍伊 (zh-hans), 大衛·鮑伊 (zh-hant), 大衛·鮑伊 (zh-mo), 大卫·鲍伊 (zh-my), 大卫·鲍威 (zh-sg), 大衛·鮑伊 (zh-tw), დევიდ ბოუი (xmf), David Bowie (ext), ديفيد باوى (arz), David Bowie (lfn), டேவிட் போவி (ta), ڈیوڈ بوئی (ur), David Bowie (en), 大卫·鲍伊 (wuu), David Bowie (vro), ዴቪድ ቦሊ (am), Տէյվիտ Պոուի (hyw), David Bowie (kl), David Bowie (tl), David Bowie (smn), David Bowie (dag), David Bowie (kw), David Bowie (ku), David Bowie (tw), David Bowie (sje), David Bowie (jut), David Bowie (mos), David Bowie (ha), David Bowie (no), David Bowie (en-us)"
Didn’t I tell you how lists and functional programming are important in R?
Sooner or later you will want to extract tabular data from PDF files…
# Load the tabulizer library
# Download Download the Microsoft Build of OpenJDK
# https://learn.microsoft.com/en-us/java/openjdk/download
# Find your JAVA_HOME (to be exemplified in the session) and then:
# Sys.setenv(JAVA_HOME="C:/Program Files/Microsoft/jdk-11.0.23.9-hotspot")
library(tabulapdf)
# Define the path to the PDF file
data_dir <- paste0(getwd(), "/_data/")
pdf_path <- paste0(data_dir, "mtcars.pdf")
# Extract all tables from the PDF
tables <- tabulapdf::extract_tables(pdf_path)
Returned a list tables:
tables
[[1]]
[[2]]
[[3]]
NA
Extract to data.frame, one by one:
mtcars <- tables[[1]]
iris <- tables[[2]]
bio_data <- tables[[3]]
# install.packages("WDI)
library(WDI)
WDIsearch('gdp')
data_set = WDI(indicator='NY.GDP.MKTP.CD',
country=c('US'),
start=1960, end=2022)
print(data_set)
You will need to learn about The World Bank Data in order to understand the indicators!
data_set = WDI(indicator='NY.GDP.MKTP.CD',
country=c('RS', 'HR', 'SI', 'ME', 'BA', 'MK'),
start=1990,
end=2023)
print(data_set)
ggplot(data_set,
aes(x = year,
y = NY.GDP.MKTP.CD,
color = country)) +
geom_line() +
xlab('Year') + ylab('GDP per capita') +
theme_bw() +
theme(panel.border = element_blank())
Your task is to provide an EDA based on World Bank Data on
the how ex-Yu countries developed after 1990. Use ggplot2 and
Plotly for visualizations. Study any World Bank Data indicators that you
might find interesting for the comparative study at hand, obtained them
via the WDI package, cross-tabulate against each other if
necessary or interesting, and provide your insights in R Markdown.
First, have a look at this treasure: A list of over 1,000 datasets available in R packages…
Now, here is a list of some R packages specifically designed to
connect to third-party open-data resources and provide
data.frames:
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
Goran S. Milovanović
DataKolektiv, 2024.
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.